17 research outputs found

    Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

    Full text link
    Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.Comment: 61 pages, 9 figures, 2 table

    F: Regression Models over Factorized Views

    Get PDF
    ABSTRACT We demonstrate F, a system for building regression models over database views. At its core lies the observation that the computation and representation of materialized views, and in particular of joins, entail non-trivial redundancy that is not necessary for the efficient computation of aggregates used for building regression models. F avoids this redundancy by factorizing data and computation and can outperform the state-of-the-art systems MADlib, R, and Python StatsModels by orders of magnitude on real-world datasets. We illustrate how to incrementally build regression models over factorized views using both an in-memory implementation of F and its SQL encoding. We also showcase the effective use of F for model selection: F decouples the datadependent computation step from the data-independent convergence of model parameters and only performs once the former to explore the entire model space. WHAT IS F? F is a fast learner of regression models over training datasets defined by select-project-join-aggregate (SPJA) views. It is part of an ongoing effort to integrate databases and machine learning including MADlib [2] and Santoku (1) The database joins are an unnecessarily expensive bottleneck for learning due to redundancy in their tabular representation. To alleviate this limitation, F learns models in one pass over factorized joins, where repeating data patterns are only computed and represented once. This has both theoretical and practical benefits. The computational complexity of F follows that of factorized materialized SPJA views The first step computes the aggregates necessary for regression and the factorized view on the input database. The output of this step is a matrix of reals whose dimensions only depend on the arity of the view and is independent of the database size. This matrix contains the necessary information to compute the parameters of any model defined by a subset of the features in the view. This step comes in three flavors F's factorization and task decomposition rely on a representation of data and computation as expressions in the sum-product commutative semiring, which is subject to the law of distributivity of product over sum. Results of SPJA queries are naturally represented in the semiring with Cartesian product as product and union as sum. The derivatives of the objective functions for Least-Squares, Ridge, Lasso, and Elastic-Net regression models are expressible in the sum-product semiring. Optimization methods such as gradient descent and (quasi) Newton, which rely on first and respectively second-order derivatives of such objective functions, can thus be used to train any such model using F. HOW DOES F WORK? We next explain F by means of an example for learning a least-squares regression model over a factorized join. Factorized Joins
    corecore